Group 4 Project Assignment: NYC Flights
Introduction
In recent years, flight delays have cost the airline industry millions of dollars and have become a recurring problem. Therefore, it is essential to understand the behavior of flight delays.
This report’s objective is to analyse the data set (NYC Flights) and offer suggestions for reducing the departure delays. Time, weather, season, carrier, aircraft, and many other factors might cause a flight departure delay. To understand this, we have examined the relationship between departure delay and a range of other characteristics in the data set.
NYC Flights data sets provided include flight information for three New York City airports: John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR), and La Guardia Airport (LGA). Furthermore, they contain data on the weather, airports, airlines, and planes. The main goal here is reducing the departure delay.
1. Description of the problem
As a team, we are working for a business analytics consulting company. Port Authority of New York and New Jersey (PANYNJ) approached our company and requested that we analyze historical data to understand general trends in flight patterns and airport performance in NYC and examine different issues related to departure delays. Below are sections that detail all tasks associated with our analysis.
2. Assumption
1. All the three airports in question [EWR,JFK,LGA] follow similar air traffic regulations, security policies, boarding policies and baggage handling systems.
2. We are assuming that the time required for the individual flights to take off from or touchdown on the runway is the same in all three airports.
3. We are only taking into account the flights that have a positive departure delay. Those with negative values (early departures) is not considered.
4. We are assuming that flight cancellation will only result from egregious weather conditions and irreparable operational or technical issues in the aircraft.
5. When calculating average departure delay time and number of flights delayed, we exclude the 1% quantile largest observations,in order to sift out extremely high values.
3. Potential data issue in the data set
1. Bias in analysis could occur due to the lack of flight data from 12 am to 5 am.
2. The data set does not contain certain significant factors that adequately explain flight delays, particularly delays caused by air traffic congestion, boarding/airline problems, etc.
3. Co-linearity exist between different parameters in data sets provided particularly in the case of weather which could lead to inaccurate insights from analysis.
4. The insights drawn from the analysis is not accurate to the present day conditions that contribute towards the departure delay in the NYC airports due to the fact that the data set is based in 2013.
5. Due to the fact that the data of several characteristics in different data set has skewed distributions the standard deviation could be highly inflated in those cases, making those characteristics a poor measure of variability.
6. The exploratory analysis does not address the underlying cause of departure delays; instead, it concentrates on visualizing the key characteristics of data sets.
7. The data set does not provide adequate information regarding the distinct categories of flights such as commercial passenger flights , freight air crafts or private jets.
4. Objectives
This report aims to answer the following questions
Question 1 - Is there a pattern to the departure delay in terms of time? (Month, Day of week and Hour)
Question 2 - How does weather impact flights from NYC? What is the effect of weather on departure delay?
Question 3 - What is effect of departure delay on airport and carrier ? Which airport and carrier are the best and the worst ?
Question 4 - What is the impact of plane manufacturer and structure of the aircraft on departure delays?
Question 5 : Is there a pattern to the departure delay in terms of geography of our analysis?
Exploratory Data Analysis
1.Setting up the environment and loading the library
library(tidyverse)
library(dplyr)
library(xray)
library(ggplot2)
library(lubridate)
library(corrgram)
library(corrplot)2.Reading Data set
flights <-read.csv(file="flights.csv")
airlines <- read.csv(file ="airlines.csv")
planes <- read.csv(file="planes.csv")
airports <-read.csv(file="airports.csv")
weather <- read.csv(file="weather.csv")Data set overview
image: Source:http://bigdatasummerinst.sph.umich.edu/wiki2019/images/6/63/Bdsi_2019_r_practice_dplyr_nycflights_answers.pdf
Data we work on consist of five CSV files that incorporate the following variables:
airlines.csv - Airline carrier code and carrier full names
airports.csv - Airport metadata with
faa - FAA airport code
name - usual name of the airport
lat, long - location of airport as latitude,
longitude
alt - altitude (in feet)
tz - timezone offset from GMT
dst - Daylight savings time zone
tzone - IANA time zone, as determined by GeoNames
webservice
flights.csv - On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013
year, month, day - date of departure
dep_time, arr_time - actual departure and arrival times
(format HHMM or HMM), local time zone
sched_dep_time, sched_arr_time - scheduled departure
and arrival times (format HHMM or HMM), local time zone
dep_delay, arr_delay - Departure and arrival delays, in
minutes. Negative times represent early departures/arrivals
carrier - two letter carrier abbreviation.
flight - flight number
tailnum - plane tail number
origin, dest - origin and destination
air_time - amount of time spent in the air, in
minutes.
distance - distance between airports, in miles.
hour, minute - time of scheduled departure broken into
hour and minutes.
time_hour - scheduled date and hour of the flight as a
date.
planes.csv - Plane metadata for all plane tailnumbers found in the FAA aircraft registry.
tailnum - Tail number
year - Year manufactured.
type - Type of plane.
manufacturer, model - Manufacturer and model.
engines, seats - Number of engines and seats.
speed - Average cruising speed in mph.
engine - Type of engine.
weather.csv - Hourly meterological data for LGA, JFK and EWR
origin - Weather station location
year, month, day, hour - Time of recording.
temp, dewp- Temperature and dewpoint in F.
humid- Relative humidity.
wind_dir, wind_speed, wind_gust - Wind direction (in
degrees), speed and gust speed (in mph).
precip - Precipitation, in inches.
pressure - Sea level pressure in millibars.
visib - Visibility in miles.
time_hour- Date and hour of the weather station
recording as a POSIXct date.
3. Data skimming
glimpse(flights)## Rows: 327,346
## Columns: 20
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <int> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <int> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-0…
summary(flights)## ID year month day
## Min. : 1 Min. :2013 Min. : 1.000 Min. : 1.00
## 1st Qu.: 81837 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Median :163674 Median :2013 Median : 7.000 Median :16.00
## Mean :163674 Mean :2013 Mean : 6.565 Mean :15.74
## 3rd Qu.:245510 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :327346 Max. :2013 Max. :12.000 Max. :31.00
## dep_time sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 1 Min. : 500 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 907 1st Qu.: 905 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1122
## Median :1400 Median :1355 Median : -2.00 Median :1535 Median :1554
## Mean :1349 Mean :1340 Mean : 12.56 Mean :1502 Mean :1533
## 3rd Qu.:1744 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1944
## Max. :2400 Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:327346 Min. : 1 Length:327346
## 1st Qu.: -17.000 Class :character 1st Qu.: 544 Class :character
## Median : -5.000 Mode :character Median :1467 Mode :character
## Mean : 6.895 Mean :1943
## 3rd Qu.: 14.000 3rd Qu.:3412
## Max. :1272.000 Max. :8500
## origin dest air_time distance
## Length:327346 Length:327346 Min. : 20.0 Min. : 80
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 509
## Mode :character Mode :character Median :129.0 Median : 888
## Mean :150.7 Mean :1048
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## hour minute time_hour
## Min. : 5.00 Min. : 0.00 Length:327346
## 1st Qu.: 9.00 1st Qu.: 8.00 Class :character
## Median :13.00 Median :29.00 Mode :character
## Mean :13.14 Mean :26.23
## 3rd Qu.:17.00 3rd Qu.:44.00
## Max. :23.00 Max. :59.00
anomalies(flights)## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 minute 327346 0 - 58924 18% 0 - 0 - 60
## 2 dep_delay 327346 0 - 16466 5.03% 0 - 0 - 526
## 3 arr_delay 327346 0 - 5409 1.65% 0 - 0 - 577
## 4 year 327346 0 - 0 - 0 - 0 - 1
## 5 origin 327346 0 - 0 - 0 - 0 - 3
## 6 month 327346 0 - 0 - 0 - 0 - 12
## 7 carrier 327346 0 - 0 - 0 - 0 - 16
## 8 hour 327346 0 - 0 - 0 - 0 - 19
## 9 day 327346 0 - 0 - 0 - 0 - 31
## 10 dest 327346 0 - 0 - 0 - 0 - 104
## 11 distance 327346 0 - 0 - 0 - 0 - 213
## 12 air_time 327346 0 - 0 - 0 - 0 - 509
## 13 sched_dep_time 327346 0 - 0 - 0 - 0 - 1020
## 14 sched_arr_time 327346 0 - 0 - 0 - 0 - 1162
## 15 dep_time 327346 0 - 0 - 0 - 0 - 1317
## 16 arr_time 327346 0 - 0 - 0 - 0 - 1410
## 17 flight 327346 0 - 0 - 0 - 0 - 3835
## 18 tailnum 327346 0 - 0 - 0 - 0 - 4037
## 19 time_hour 327346 0 - 0 - 0 - 0 - 6922
## 20 ID 327346 0 - 0 - 0 - 0 - 327346
## type anomalous_percent
## 1 Integer 18%
## 2 Integer 5.03%
## 3 Integer 1.65%
## 4 Integer -
## 5 Character -
## 6 Integer -
## 7 Character -
## 8 Integer -
## 9 Integer -
## 10 Character -
## 11 Integer -
## 12 Integer -
## 13 Integer -
## 14 Integer -
## 15 Integer -
## 16 Integer -
## 17 Integer -
## 18 Character -
## 19 Character -
## 20 Integer -
##
## $problem_variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct type
## 1 year 327346 0 - 0 - 0 - 0 - 1 Integer
## anomalous_percent problems
## 1 - Less than 2 distinct values.
distributions(flights)## ================================================================================
## Variable p_1 p_10 p_25 p_50 p_75 p_90
## 1 minute 0 0 8 29 44 55
## 2 dep_delay -12 -7 -5 -2 11 49
## 3 arr_delay -44 -26 -17 -5 14 52
## 4 year 2013 2013 2013 2013 2013 2013
## 5 month 1 2 4 7 10 11
## 6 hour 6 7 9 13 17 19
## 7 day 1 4 8 16 23 28
## 8 distance 173 214 509 888 1389 2446
## 9 air_time 33 47 82 129 192 319
## 10 sched_dep_time 600 705 905 1355 1729 1944
## 11 sched_arr_time 38 916 1122 1554 1944 2200
## 12 dep_time 551 703 907 1400 1744 2008
## 13 arr_time 22 853 1104 1535 1940 2158
## 14 flight 11 207 544 1467 3412 4438
## 15 ID 3274.45 32735.5 81837.25 163673.5 245509.75 294611.5
## p_99
## 1 59
## 2 191
## 3 190
## 4 2013
## 5 12
## 6 22
## 7 31
## 8 2586
## 9 364
## 10 2225
## 11 2353
## 12 2251
## 13 2344
## 14 5736
## 15 324072.55
glimpse(weather)## Rows: 26,115
## Columns: 15
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hour <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, …
## $ temp <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.…
## $ dewp <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.…
## $ humid <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.…
## $ wind_dir <int> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,…
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, …
## $ wind_gust <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.…
## $ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pressure <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101…
## $ visib <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ time_hour <chr> "2013-01-01T06:00:00Z", "2013-01-01T07:00:00Z", "2013-01-01…
summary(weather)## origin year month day
## Length:26115 Min. :2013 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :2013 Median : 7.000 Median :16.00
## Mean :2013 Mean : 6.504 Mean :15.68
## 3rd Qu.:2013 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :2013 Max. :12.000 Max. :31.00
##
## hour temp dewp humid
## Min. : 0.00 Min. : 10.94 Min. :-9.94 Min. : 12.74
## 1st Qu.: 6.00 1st Qu.: 39.92 1st Qu.:26.06 1st Qu.: 47.05
## Median :11.00 Median : 55.40 Median :42.08 Median : 61.79
## Mean :11.49 Mean : 55.26 Mean :41.44 Mean : 62.53
## 3rd Qu.:17.00 3rd Qu.: 69.98 3rd Qu.:57.92 3rd Qu.: 78.79
## Max. :23.00 Max. :100.04 Max. :78.08 Max. :100.00
## NA's :1 NA's :1 NA's :1
## wind_dir wind_speed wind_gust precip
## Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.000000
## 1st Qu.:120.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.000000
## Median :220.0 Median : 10.357 Median :24.17 Median :0.000000
## Mean :199.8 Mean : 10.518 Mean :25.49 Mean :0.004469
## 3rd Qu.:290.0 3rd Qu.: 13.809 3rd Qu.:28.77 3rd Qu.:0.000000
## Max. :360.0 Max. :1048.361 Max. :66.75 Max. :1.210000
## NA's :460 NA's :4 NA's :20778
## pressure visib time_hour
## Min. : 983.8 Min. : 0.000 Length:26115
## 1st Qu.:1012.9 1st Qu.:10.000 Class :character
## Median :1017.6 Median :10.000 Mode :character
## Mean :1017.9 Mean : 9.255
## 3rd Qu.:1023.0 3rd Qu.:10.000
## Max. :1042.1 Max. :10.000
## NA's :2729
anomalies(weather)## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 precip 26115 0 - 24366 93.3% 0 - 0 - 59
## 2 wind_gust 26115 20778 79.56% 0 - 0 - 0 - 38
## 3 pressure 26115 2729 10.45% 0 - 0 - 0 - 469
## 4 wind_dir 26115 460 1.76% 1256 4.81% 0 - 0 - 38
## 5 wind_speed 26115 4 0.02% 1256 4.81% 0 - 0 - 37
## 6 hour 26115 0 - 1075 4.12% 0 - 0 - 24
## 7 visib 26115 0 - 10 0.04% 0 - 0 - 20
## 8 dewp 26115 1 0% 0 - 0 - 0 - 154
## 9 temp 26115 1 0% 0 - 0 - 0 - 174
## 10 humid 26115 1 0% 0 - 0 - 0 - 2500
## 11 year 26115 0 - 0 - 0 - 0 - 1
## 12 origin 26115 0 - 0 - 0 - 0 - 3
## 13 month 26115 0 - 0 - 0 - 0 - 12
## 14 day 26115 0 - 0 - 0 - 0 - 31
## 15 time_hour 26115 0 - 0 - 0 - 0 - 8714
## type anomalous_percent
## 1 Numeric 93.3%
## 2 Numeric 79.56%
## 3 Numeric 10.45%
## 4 Integer 6.57%
## 5 Numeric 4.82%
## 6 Integer 4.12%
## 7 Numeric 0.04%
## 8 Numeric 0%
## 9 Numeric 0%
## 10 Numeric 0%
## 11 Integer -
## 12 Character -
## 13 Integer -
## 14 Integer -
## 15 Character -
##
## $problem_variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct type
## 1 precip 26115 0 - 24366 93.3% 0 - 0 - 59 Numeric
## 2 year 26115 0 - 0 - 0 - 0 - 1 Integer
## anomalous_percent problems
## 1 93.3% Anomalies present in 93.3% of the rows.
## 2 - Less than 2 distinct values.
distributions(weather)## ================================================================================
## Variable p_1 p_10 p_25 p_50 p_75 p_90 p_99
## 1 precip 0 0 0 0 0 0 0.13
## 2 wind_gust 16.1109 18.4125 20.714 24.1664 28.7695 33.3726 43.7296
## 3 pressure 1001.3 1008.5 1012.9 1017.6 1023 1027.5 1036.315
## 4 wind_dir 0 30 120 220 290 330 360
## 5 wind_speed 0 4.6031 6.9047 10.357 13.8094 18.4125 26.4679
## 6 hour 0 2 6 11 17 21 23
## 7 visib 0.5 7 10 10 10 10 10
## 8 dewp 1.04 15.08 26.06 42.08 57.92 66.92 73.04
## 9 temp 19.94 32 39.92 55.4 69.98 78.8 91.04
## 10 humid 23.39 37.46 47.05 61.79 78.79 89.57 100
## 11 year 2013 2013 2013 2013 2013 2013 2013
## 12 month 1 2 4 7 9 11 12
## 13 day 1 4 8 16 23 28 31
glimpse(airlines)## Rows: 16
## Columns: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O…
## $ name <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline…
summary(airlines)## carrier name
## Length:16 Length:16
## Class :character Class :character
## Mode :character Mode :character
anomalies(airlines)## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct type
## 1 carrier 16 0 - 0 - 0 - 0 - 16 Character
## 2 name 16 0 - 0 - 0 - 0 - 16 Character
## anomalous_percent
## 1 -
## 2 -
##
## $problem_variables
## [1] Variable q qNA pNA
## [5] qZero pZero qBlank pBlank
## [9] qInf pInf qDistinct type
## [13] anomalous_percent problems
## <0 rows> (or 0-length row.names)
glimpse(planes)## Rows: 3,322
## Columns: 9
## $ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
## $ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
## $ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
## $ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
## $ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
## $ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
summary(planes)## tailnum year type manufacturer
## Length:3322 Min. :1956 Length:3322 Length:3322
## Class :character 1st Qu.:1997 Class :character Class :character
## Mode :character Median :2001 Mode :character Mode :character
## Mean :2000
## 3rd Qu.:2005
## Max. :2013
## NA's :70
## model engines seats speed
## Length:3322 Min. :1.000 Min. : 2.0 Min. : 90.0
## Class :character 1st Qu.:2.000 1st Qu.:140.0 1st Qu.:107.5
## Mode :character Median :2.000 Median :149.0 Median :162.0
## Mean :1.995 Mean :154.3 Mean :236.8
## 3rd Qu.:2.000 3rd Qu.:182.0 3rd Qu.:432.0
## Max. :4.000 Max. :450.0 Max. :432.0
## NA's :3299
## engine
## Length:3322
## Class :character
## Mode :character
##
##
##
##
anomalies(planes)## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 speed 3322 3299 99.31% 0 - 0 - 0 - 14
## 2 year 3322 70 2.11% 0 - 0 - 0 - 47
## 3 type 3322 0 - 0 - 0 - 0 - 3
## 4 engines 3322 0 - 0 - 0 - 0 - 4
## 5 engine 3322 0 - 0 - 0 - 0 - 6
## 6 manufacturer 3322 0 - 0 - 0 - 0 - 35
## 7 seats 3322 0 - 0 - 0 - 0 - 48
## 8 model 3322 0 - 0 - 0 - 0 - 127
## 9 tailnum 3322 0 - 0 - 0 - 0 - 3322
## type anomalous_percent
## 1 Integer 99.31%
## 2 Integer 2.11%
## 3 Character -
## 4 Integer -
## 5 Character -
## 6 Character -
## 7 Integer -
## 8 Character -
## 9 Character -
##
## $problem_variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 speed 3322 3299 99.31% 0 - 0 - 0 - 14
## type anomalous_percent problems
## 1 Integer 99.31% Anomalies present in 99.31% of the rows.
distributions(planes)## ================================================================================
## Variable p_1 p_10 p_25 p_50 p_75 p_90 p_99
## 1 speed 90 97 107.5 162 432 432 432
## 2 year 1984 1990 1997 2001 2005 2009 2013
## 3 engines 2 2 2 2 2 2 2
## 4 seats 9.21 55 140 149 182 200 379
glimpse(airports)## Rows: 1,458
## Columns: 8
## $ faa <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "…
## $ name <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur…
## $ lat <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4…
## $ lon <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342…
## $ alt <int> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10…
## $ tz <int> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, …
## $ dst <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",…
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Ameri…
summary(airports)## faa name lat lon
## Length:1458 Length:1458 Min. :19.72 Min. :-176.65
## Class :character Class :character 1st Qu.:34.26 1st Qu.:-119.19
## Mode :character Mode :character Median :40.09 Median : -94.66
## Mean :41.65 Mean :-103.39
## 3rd Qu.:45.07 3rd Qu.: -82.52
## Max. :72.27 Max. : 174.11
## alt tz dst tzone
## Min. : -54.00 Min. :-10.000 Length:1458 Length:1458
## 1st Qu.: 70.25 1st Qu.: -8.000 Class :character Class :character
## Median : 473.00 Median : -6.000 Mode :character Mode :character
## Mean :1001.42 Mean : -6.519
## 3rd Qu.:1062.50 3rd Qu.: -5.000
## Max. :9078.00 Max. : 8.000
anomalies(airports)## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 alt 1458 0 - 51 3.5% 0 - 0 - 911
## 2 tzone 1458 3 0.21% 0 - 0 - 0 - 10
## 3 dst 1458 0 - 0 - 0 - 0 - 3
## 4 tz 1458 0 - 0 - 0 - 0 - 7
## 5 name 1458 0 - 0 - 0 - 0 - 1440
## 6 lat 1458 0 - 0 - 0 - 0 - 1456
## 7 faa 1458 0 - 0 - 0 - 0 - 1458
## 8 lon 1458 0 - 0 - 0 - 0 - 1458
## type anomalous_percent
## 1 Integer 3.5%
## 2 Character 0.21%
## 3 Character -
## 4 Integer -
## 5 Character -
## 6 Numeric -
## 7 Character -
## 8 Numeric -
##
## $problem_variables
## [1] Variable q qNA pNA
## [5] qZero pZero qBlank pBlank
## [9] qInf pInf qDistinct type
## [13] anomalous_percent problems
## <0 rows> (or 0-length row.names)
distributions(airports)## ================================================================================
## Variable p_1 p_10 p_25 p_50 p_75 p_90 p_99
## 1 alt 0 15 70.25 473 1062.5 2906 6841.09
## 2 tz -10 -9 -8 -6 -5 -5 -5
## 3 lat 21.5382 30.4803 34.2575 40.0877 45.0671 59.9414 67.6392
## 4 lon -166.3004 -154.8695 -119.1857 -94.6619 -82.5167 -76.0951 -69.9471
With the focus of the analysis being departure delay, when the distribution() function is applied to the flight data set, it is evident that the distribution of departure delays is significantly right-skewed. However, since the objective of the work requires thorough analysis of the departure delay,we intend to keep the skewness. When it comes to departure delay,the primary focus will be on the ones with a positive duration (flights that departed late).
One point to note is that the flights data set, which contains departure delay times, is fairly clean with no null values found. However, the same can’t be said for other data sets, having varying ranges of null values. The handling of the null values is done on case by case basis.
Having considered the above said, we now move onto to the in depth analysis of the exploratory questions based on the data set
Question 1 : Is there a pattern to the departure delay in terms of time? (Month, Day of week and Hour)
Arrival Delay vs Departure Delay
Departure delay = Actual departure time − Scheduled departure time
Arrival delay = Actual arrival time − Scheduled arrival time.
We see that a positive relationship exists between dep_delay and arr_delay: as departure delays increase, arrival delays tend to also increase.In a general sense this means that the later a plane departs, typically the later it will arrive.
In the graph below, there is a cluster of points near (0, 0).The point (0,0) means no delay in departure and arrival. From the passenger’s point of view, this means the flight was on time. It seems most flights are at least close to being on time in all the origin airports [EWR,JFK,LGA].
We can also observe large positive values of dep_delay which may be due to many factors such as adverse weather conditions. In such cases flights will be required to take off or land at airports with more restrictions. As a result, there may be an increase in departure or arrival delays.
flights%>%
ggplot()+ aes(x = dep_delay, y = arr_delay,color=origin) +
geom_point(alpha = 0.2)+labs(x="Departure Delay (in minutes)", y="Arrival delay (in minutes)", title = "Arrival Delay vs Departure Dealy ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())In our study, we are looking for patterns in the delays experienced by flights departing from New York City.
Setting up global variable for flights
flights_seasonal <- flights %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
mutate(season = ifelse(month %in% 9:11, "Fall",
ifelse(month %in% 6:8, "Summer",
ifelse(month %in% 3:5, "Spring","Winter")))) %>%
mutate(month=factor(month,levels=1:12,labels=c("Jan","Feb","Mar","Apr","May",
"Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE))%>%
mutate(date = ymd(paste(year, month, day))) %>%
mutate(date = ymd(paste(year, month, day)),dayofweek=weekdays(date)) %>%
mutate(day_of_week=factor(dayofweek,levels = c("Sunday","Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday"),
labels=c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"),
ordered=TRUE))
head(flights_seasonal)## ID year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 1 2013 Jan 1 517 515 2 830 819
## 2 2 2013 Jan 1 533 529 4 850 830
## 3 3 2013 Jan 1 542 540 2 923 850
## 4 20 2013 Jan 1 601 600 1 844 850
## 5 26 2013 Jan 1 608 600 8 807 735
## 6 27 2013 Jan 1 611 600 11 945 931
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -6 B6 343 N644JB EWR PBI 147 1023 6 0
## 5 32 MQ 3768 N9EAMQ EWR ORD 139 719 6 0
## 6 14 UA 303 N532UA JFK SFO 366 2586 6 0
## time_hour season date dayofweek day_of_week
## 1 2013-01-01T10:00:00Z Winter 2013-01-01 Tuesday Tue
## 2 2013-01-01T10:00:00Z Winter 2013-01-01 Tuesday Tue
## 3 2013-01-01T10:00:00Z Winter 2013-01-01 Tuesday Tue
## 4 2013-01-01T11:00:00Z Winter 2013-01-01 Tuesday Tue
## 5 2013-01-01T11:00:00Z Winter 2013-01-01 Tuesday Tue
## 6 2013-01-01T11:00:00Z Winter 2013-01-01 Tuesday Tue
flights_delayed_flight <- flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
select(carrier,hour, origin,month, season, dep_delay,time_hour)
head(flights_delayed_flight)## carrier hour origin month season dep_delay time_hour
## 1 UA 5 EWR Jan Winter 2 2013-01-01T10:00:00Z
## 2 UA 5 LGA Jan Winter 4 2013-01-01T10:00:00Z
## 3 AA 5 JFK Jan Winter 2 2013-01-01T10:00:00Z
## 4 B6 6 EWR Jan Winter 1 2013-01-01T11:00:00Z
## 5 MQ 6 EWR Jan Winter 8 2013-01-01T11:00:00Z
## 6 UA 6 JFK Jan Winter 11 2013-01-01T11:00:00Z
The below code shows the mean value of the departure delay
paste(flights%>%
filter(dep_delay > 0,dep_delay < quantile(dep_delay, 0.99))%>%
summarise(mean(dep_delay)))## [1] "33.4432946620818"
1.Trend of Average Departure Delay by Hour
flights_seasonal%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(hour) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x = as.numeric(hour), y = avg_dep_delay,color =hour)+geom_point()+scale_x_continuous(breaks = c(5,11,17,22), labels = function(x){case_when(x == 5 ~ '5am', x == 11 ~ '11am', x == 17 ~ '5pm', x == 22 ~ '10pm')})+
geom_smooth(position = "identity")+labs(x="Hours",y="Average Departure Delay (in minutes)",title="Average delay for each hour")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())The above graph explains why you are more likely to be delayed if you fly later in the day rather than in the morning.
Following the data, the preferred time to fly is between 5 a.m. and 8 a.m. to avoid delays since the average departure delay is approximately 20 minutes
Minimizing departure delays in flights early in the day is also beneficial for that flight and subsequent flights, by reducing the propagation of delay between consecutive flights.
It is evident that the trend is increasing, with the average departure delay exceeding the mean departure delay between 3 pm and 12 pm.
flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(hour,carrier) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x = hour, y = avg_dep_delay,fill=avg_dep_delay>=33.44) +
geom_col(alpha=0.7)+labs(x="Hours",y="Average Departure Delay (in minutes)",
title="Average delay for each hour facet by carrier ") +
geom_hline(aes(yintercept=33.44),linetype = 2)+
facet_wrap(carrier~.,ncol = 4)+theme_bw()
According to our observations, most of the carriers have a
‘propogation’ delay, which means that the average delay increases as we
approach the day’s end.
flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(hour,origin) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x = hour, y = avg_dep_delay,fill=avg_dep_delay>=33.44) +
geom_col(alpha=0.7)+labs(x="Hours",y="Average Departure Delay (in minutes)",
title="Average delay for each hour facet by origin") +
geom_hline(aes(yintercept=33.44),linetype = 2)+
facet_wrap(origin~.)+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())
According to our observations, the origin airports also have
delay propagation, which means that the average departure delay at a
flight stage causes a ripple effect in the subsequent stages of a
flight, which in turn means that the average departure delay increases
as the day goes on.
2.Seasonal Trend of Average Departure Delay by Hour in each season: Fall, Summer, Spring, Winter
The following graph shows the Average Departure Delay by in each season: Fall, Summer, Spring, Winter
flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(month, season) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x =factor(month), y = avg_dep_delay, group=season, fill=season)) +
geom_col() +labs(x="Months", y="Average Depature delay (in minutes)", title = "Average Departure Delay vs Seasons ")+geom_hline(aes(yintercept=33.443),linetype = 2)+geom_text(aes( 10, 33.443+2, label="Avg Dep_delay(min)"), size = 3 , color="black")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())New York’s climate is classed as being continental, which means that it receives four distinct seasons spring (March-May), summer (June-August), autumn (September-November) and winter (December-February).
Thus, we can conclude that the average departure delay exceeded the mean average departure delay during the Summer (June and July),peak season. The high rate of tourism during the summer season might be the cause for the high average departure delay .
flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(hour, season) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x =hour, y = avg_dep_delay, group=season, color=season)) +
geom_line(lwd = 2) +labs(x="Hour", y="Average Depature delay (in minutes)", title = "Seasonal Trend of Average Departure Delay by Hour ")+geom_hline(aes(yintercept=33.443),linetype = 2)+geom_text(aes( 10, 33.443+2, label="Avg Dep_delay(min)"), size = 3 , color="black")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())The above graph shows the Seasonal Trend of Average Departure Delay by Hour in each season: Fall, Summer, Spring,Winter.It is evident that the trend for the summer and spring is changing rapidly in each hour.
flights_delayed_flight %>%
ggplot(aes(hour,color=season))+
geom_freqpoly(binwidth = 1,lwd=2) +
ggtitle("Seasonal trend of Number of Delayed flight by hour") +theme(plot.title = element_text(hjust = 0.5))+labs(x="Hour", y="Number of Delayed flights ", title = "Seasonal Trend of Average Departure Delay by Hour ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())A comparison of the total number of delayed flights per hour by season indicates that despite having a similar trend each season, summer and spring have the highest number of delayed flights.
flights %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
mutate(date = ymd(paste(year, month, day))) %>%
group_by(date) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x = date, y = avg_dep_delay, color=date) +
geom_point()+geom_smooth(position = "identity")+labs(x="Dates", y="Average Depature delay (in minutes)", title = "Average Depature Delay vs Dates ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())The total number of delayed flights per month increases during the spring and summer i.e from April to July and during the winter, specifically in December.
3.Seasonal Trend of Average Departure Delay vs Number of Delayed flight
flights %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(month,origin) %>%
summarise(count=n())%>%
ggplot(aes(x=month,y=count))+
geom_line(color="#00AFBB",lwd=2)+geom_point(size=2)+
scale_x_discrete(limits=1:12)+
labs(x="Month",y="Number of Departure Delays",title="Number of Departure Delays vs Month")+
facet_wrap(origin~.)+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
panel.background = element_blank()
)In each origin, the three curves exhibit a similar trend. Although they are very similar, we see that number of delayed flights in LGA are lower than in other airports.
flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(month)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(x=month,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Month",y="Average Departure Delay (in minutes)",title="Average delay for each day of the month ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())Among the months, September has the lowest number of delayed flights around 8000, and June and July have the highest number. As stated earlier, June and July experience the highest average delays in departures. During December, flight delays are on the rise with an increase in the number of flights delayed.
4.Average Departure Delay vs Day of week
flights_seasonal %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(day_of_week)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(x=day_of_week,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Day of the week",y="Average Departure Delay (in minutes)",title="Average delay for each day of the week ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())From the above plot, it appears that Saturday that the lowest average departure delay and lowest number of delayed flights (approximately 14000). Thursday and Friday have the most number of delayed flights, approximately 20000.
Another point to note here is that Monday and Thursday have the highest average departure delays
Question 2 : How does weather impact flights from NYC? What is the effect of weather on departure delay?
Setting up global variable for weather
data_fw <- flights %>%
inner_join(weather, by = c("origin", "time_hour","month","hour"))%>%
mutate(count_delayed = ifelse(dep_delay > 0, 1, 0))%>%
mutate(season = ifelse(month %in% 9:11, "Fall",
ifelse(month %in% 6:8, "Summer",
ifelse(month %in% 3:5, "Spring","Winter"))))
glimpse(data_fw)## Rows: 325,819
## Columns: 33
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ year.x <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day.x <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <int> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <int> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-0…
## $ year.y <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ day.y <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ temp <dbl> 39.02, 39.92, 39.02, 39.02, 39.92, 39.02, 37.94, 39.92,…
## $ dewp <dbl> 28.04, 24.98, 26.96, 26.96, 24.98, 28.04, 28.04, 24.98,…
## $ humid <dbl> 64.43, 54.81, 61.63, 61.63, 54.81, 64.43, 67.21, 54.81,…
## $ wind_dir <int> 260, 250, 260, 260, 260, 260, 240, 260, 260, 260, 260, …
## $ wind_speed <dbl> 12.65858, 14.96014, 14.96014, 14.96014, 16.11092, 12.65…
## $ wind_gust <dbl> NA, 21.86482, NA, NA, 23.01560, NA, NA, 23.01560, NA, 2…
## $ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pressure <dbl> 1011.9, 1011.4, 1012.1, 1012.1, 1011.7, 1011.9, 1012.4,…
## $ visib <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ count_delayed <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ season <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Wint…
We use a correlation matrix to understand what variables might be most correlated to dep-delay.
Variables are:
1.temp
2.dewp
3.humid
4.precip
5.pressure
6.visibility
According to the correlation plot, here are few inferences
1.High relative humidity results in low visibility 2.High relative humidity results in precipitation .The higher the humidity the greater the water vapor, and the more rain we’re likely to see. 3.Since dewp and temp are highly correlated, we will only investigate one of them
data_fw <-
cor_data <- select(data_fw, dep_delay, temp, dewp, humid,precip, pressure, visib)%>%
na.omit%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))
#WE first plot a correlation Matrix using corrplot to find the variables that
#are correlated. We create a correlation matrix using 'cor' function
corrplot(cor(na.omit(cor_data)), method = "square")data_fw <- flights %>%
inner_join(weather, by = c("origin", "time_hour","month","hour"))%>%
mutate(count_delayed = ifelse(dep_delay > 0, 1, 0))%>%
mutate(season = ifelse(month %in% 9:11, "Fall",
ifelse(month %in% 6:8, "Summer",
ifelse(month %in% 3:5, "Spring","Winter"))))
head(data_fw)## ID year.x month day.x dep_time sched_dep_time dep_delay arr_time
## 1 1 2013 1 1 517 515 2 830
## 2 2 2013 1 1 533 529 4 850
## 3 3 2013 1 1 542 540 2 923
## 4 4 2013 1 1 544 545 -1 1004
## 5 5 2013 1 1 554 600 -6 812
## 6 6 2013 1 1 554 558 -4 740
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance
## 1 819 11 UA 1545 N14228 EWR IAH 227 1400
## 2 830 20 UA 1714 N24211 LGA IAH 227 1416
## 3 850 33 AA 1141 N619AA JFK MIA 160 1089
## 4 1022 -18 B6 725 N804JB JFK BQN 183 1576
## 5 837 -25 DL 461 N668DN LGA ATL 116 762
## 6 728 12 UA 1696 N39463 EWR ORD 150 719
## hour minute time_hour year.y day.y temp dewp humid wind_dir
## 1 5 15 2013-01-01T10:00:00Z 2013 1 39.02 28.04 64.43 260
## 2 5 29 2013-01-01T10:00:00Z 2013 1 39.92 24.98 54.81 250
## 3 5 40 2013-01-01T10:00:00Z 2013 1 39.02 26.96 61.63 260
## 4 5 45 2013-01-01T10:00:00Z 2013 1 39.02 26.96 61.63 260
## 5 6 0 2013-01-01T11:00:00Z 2013 1 39.92 24.98 54.81 260
## 6 5 58 2013-01-01T10:00:00Z 2013 1 39.02 28.04 64.43 260
## wind_speed wind_gust precip pressure visib count_delayed season
## 1 12.65858 NA 0 1011.9 10 1 Winter
## 2 14.96014 21.86482 0 1011.4 10 1 Winter
## 3 14.96014 NA 0 1012.1 10 1 Winter
## 4 14.96014 NA 0 1012.1 10 0 Winter
## 5 16.11092 23.01560 0 1011.7 10 0 Winter
## 6 12.65858 NA 0 1011.9 10 0 Winter
This section we will use two measures to understand the relationship between departure delay and the weather variables.
1.Delay Percent: The number of delayed flight over total number of flights with respect to different value of the weather variable
2.Average Departure delay: The average departure delay of all delayed flights with respect to different value of the weather variable
1. Precipitation
data_fw %>%
filter(!is.na(precip)) %>%
group_by (month,season,origin) %>%
summarise(avg_precip = mean(precip, na.rm = TRUE)) %>%
ggplot()+ aes(x = factor(month), y =avg_precip,fill=season)+geom_col(position = "identity")+theme_bw()+
scale_x_discrete(limits=1:12)+
labs(x="Month",y="Average precipitataion (in inches)",title="Number of departure delays vs Month")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())+facet_wrap(origin~.)We see from the graph that there is a high average precipitation during the months of June, July in all the origins. The highest average precipitation being 1.2 inches in EWR in June.
data_fw%>%
filter(!is.na(precip ))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(precip) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =precip,y=avg_dep_delay,color = avg_dep_delay) + geom_point()+
geom_smooth()+labs(x="Precipitation (in inches)", y="Average Depature delay (in minutes)", title = "Precipitation vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())The average departure delay increases as the precipitation increases from the value of 0.2 inches as seen in the graph above.
data_fw %>%
filter(precip > 0, precip < quantile(precip, 0.99)) %>%
group_by(precip,season)%>%
summarise(count_delay = sum(count_delayed),
count = n())%>%
ggplot(aes(x=precip,y=(100*(count_delay/count)))) +geom_line(stat = "identity",lwd=2,color="#00AFBB") +labs(x="Precipitation (in inches) ",y="Delay Percent (%)",title="Delay Percent (%) vs Precipitation (in inches) ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank()
)+facet_wrap(season~.)We also see that delay percent is higher during the summer ranging from 60% to 80 %. And there is a decreasing trend in winter and an increasing trend in fall.
2. Humidity
Relative humidity is usually high at midnight and in the early morning and it drops rapidly after the sun rises, until it is lowest just after midday. It then increases again till midnight. A correlation exists between relative humidity and average delay like precipitation, as we saw that the higher the relative humidity, the greater the chance of rain.
data_fw %>%
filter(!is.na(humid)) %>%
group_by(hour,season) %>%
summarise(avg_humid = mean(humid, na.rm = TRUE)) %>%
ggplot()+ aes(x =hour, y = avg_humid, color=season)+geom_line(position = "identity",lwd=2)+theme_bw()+
labs(x="Hours",y="Relative Humidity (%) ",title="Average Humidity vs Hours")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())data_fw%>%
filter(!is.na(humid))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(humid) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =humid,y=avg_dep_delay,color = avg_dep_delay) +
geom_smooth()+labs(x="Relative humidity (%)", y="Average Depature delay (in minutes)", title = "Relative humidity. vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())The average departure delay increases as the relative humidity increases as you see in the graph above.
data_fw %>%
filter(!is.na(humid))%>%
filter(humid > 0, humid < quantile(humid, 0.99)) %>%
group_by(humid,hour)%>%
summarise(count_delay = sum(count_delayed),
count = n())%>%
ggplot(aes(x=humid,y=(100*(count_delay/count)))) +geom_boxplot(fill="#00AFBB") +labs(x="Relative humidity ",y="Delay Percent (%)",title="Delay Percent (%) vs Relative humidity ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank()
)+facet_wrap(hour~.)+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())We see a higher delay percentage during 5 a.m. and from 21 to 23 p.m. due to humidity variations within a day.
3. Visibility
Visibility is estimated from the intensity of scattered light, which decreases when there are more fog droplets, smoke or haze particles, raindrops or snowflakes in the beam.
From the below graph we see that the visibility is low in the winter seasons as a result of fog droplets, smoke or haze particles, raindrops or snowflakes in the beam.
data_fw %>%
filter(!is.na(visib)) %>%
group_by(hour,season) %>%
summarise(avg_visib = mean(visib, na.rm = TRUE)) %>%
ggplot()+ aes(x =hour, y = avg_visib, color=season)+geom_line(position = "identity",lwd=2)+theme_bw()+
labs(x="Hours",y="Average Visibility (in miles) ",title="Average Visibility vs Hours in each seasons")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())data_fw%>%
filter(!is.na(visib))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(visib) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =visib,y=avg_dep_delay,color = avg_dep_delay) +geom_point()+
geom_smooth()+labs(x=" Visibility (in miles)", y="Average Depature delay (in minutes)", title = " Visibility (in miles). vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())
Visibility is one of the main reasons for departure delay,
better visibility helps decrease separation distance during take-off
sequence or landing queue which contributes to reducing departure
delays. Here the separation distance is the distance between the current
aircraft and the preceding aircraft in the same runway.
Low visibility leads to increasing take-off or landing separations, and this further reduces the airport’s capacity which is then likely to result in departure or arrival delays
data_fw %>%
filter(visib > 0, visib < quantile(visib, 0.99)) %>%
group_by(visib,season)%>%
summarise(count_delay = sum(count_delayed),
count = n())%>%
ggplot(aes(x=visib,y=(100*(count_delay/count)))) +geom_violin(trim=FALSE,color="#00AFBB") +labs(x="Visibility (in miles) ",y="Delay Percent (%)",title="Delay Percent (%) vs Visibility (in miles) ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank()
)+facet_wrap(season~.)According to the violin plot, there is a a high distribution of delay percent during the the winter months. Wider sections of the violin plot represent a distribution of delay percent which implies a significant proportion of delayed flights.
4. Pressure
When the air pressure is high, the air molecules become more tightly packed and denser. Aircraft performance depends on this pressure. The propeller is more effective when it is pushing more air molecules to produce thrust. The wing generates more lift when it is pushing more air molecules downwards. Hence this may result in low average departure delay since the take-off is easier.
In the weather data set, the pressure parameter has null values. The data cleaning process here involves the removal of the null values using the na.rm() function
data_fw %>%
filter(!is.na(pressure)) %>%
group_by(hour,season) %>%
summarise(avg_pressure = mean(pressure, na.rm = TRUE)) %>%
ggplot()+ aes(x =hour, y = avg_pressure, color=season)+geom_line(position = "identity",lwd=2)+theme_bw()+
labs(x="Hours",y="Average Pressure (in millibars) ",title="Average Pressure vs Hours in each seasons")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())Air pressure is lowest during the summer season because the temperature is high since warm air is less dense than cold air. As the density of the air increases (high pressure), aircraft performance increases; conversely as air density decreases (low pressure ), aircraft performance decreases.
data_fw%>%
filter(!is.na(pressure))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(pressure) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =pressure,y=avg_dep_delay,color = avg_dep_delay) +
geom_smooth()+labs(x=" Pressure (in millibars)", y="Average Depature delay (in minutes)", title = " Pressure (in millibars) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())From the above graph it can be seen that the average departure delay decreases as the pressure increases ,approximately from a value around 980 millibars.
data_fw %>%
filter(!is.na(pressure))%>%
group_by(pressure,season)%>%
summarise(count_delay = sum(count_delayed),
count = n())%>%
ggplot(aes(x=pressure,y=(100*(count_delay/count)))) +geom_boxplot(trim=FALSE,color="#00AFBB") +labs(x="Pressure (in millibars)",y="Delay Percent (%)",title="Delay Percent (%) vs Pressure (in millibars) ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank()
)+facet_wrap(season~.)We also see that delay percent is higher during the summer ranging from 30% to 60 %. And there is a low distribution of delay percent in winter and fall.
5. Temperature
As far as temperature is concerned, we assume that both hot and cold temperatures present adverse weather conditions and affect flight delays. In contrast, hot temperatures adversely affect aircraft engine performance, whereas cold temperatures are often associated with foggy and snowy days, which may result in poor airport surface performance and, as a consequence, adversely affect flight delays as well.
data_fw %>%
filter(!is.na(temp)) %>%
group_by(month,season) %>%
summarise(avg_temp = mean(temp, na.rm = TRUE)) %>%
ggplot()+ aes(x = factor(month), y =avg_temp,fill=season)+geom_col(position = "identity")+theme_bw()+
scale_x_discrete(limits=1:12)+
labs(x="Month",y="Average Temperature (in F) ",title="Average Temperature (in F) vs Month")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())data_fw%>%
filter(!is.na(temp))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(temp) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =temp,y=avg_dep_delay,color = avg_dep_delay) +
geom_smooth()+labs(x=" Temperature (in F)", y="Average Depature delay (in minutes)", title = "Temperature (in F) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())According to the graph, low temperatures exhibit a high average delay, also the delay increases above 40 F with increasing temperature.Thus a moderate range of temperature is optimal for low average departure delays
data_fw %>%
filter(!is.na(temp))%>%
group_by(temp,season)%>%
summarise(count_delay = sum(count_delayed),
count = n())%>%
ggplot(aes(x=temp,y=(100*(count_delay/count)))) +geom_boxplot(trim=FALSE,color="#00AFBB") +labs(x="Temperature (in F) ",y="Delay Percent (%)",title="Delay Percent (%) vs Temperature (in F) ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank()
)+facet_wrap(season~.)As shown by the graph, low temperatures are characterized by a high average delay, which increases with an increase in temperature. Due to this, the percentage of delay is higher in the summer and the winter.
6. Wind Speed
The wind-speed variables provide the speed of the wind at the departure airport during the hour of the scheduled departure time of the flight.
High wind speed can affect an aircraft’s operation safety, further leading to severe delays.
In the weather data set, the wind speed parameter also has null values. The data cleaning process here, similar to pressure, involves the removal of the null values using the na.rm() function, considering the fact that it takes only 0.02% of the total number of values in that column
data_fw%>%
filter(!is.na(wind_speed))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(wind_speed,origin) %>%
ggplot()+ aes(x =hour,y=wind_speed) +
geom_smooth()+labs(x=" Hour", y="Wind speed (in mph)", title = " Wind speed (in mph) vs Hour")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())+facet_wrap(origin~.)The variation of wind speed with time of day is called the diurnal cycle. Near the earth’s surface, winds are usually greater during the middle of the day and decrease at night. This is due to solar heating, which causes “bubbles” of warm air to rise.
From the graph we can say that JFK is the windiest airport among the three origins.
data_fw%>%
filter(!is.na(wind_speed))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(wind_speed) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =wind_speed,y=avg_dep_delay,color = avg_dep_delay) +
geom_smooth()+labs(x=" Wind speed (in mph)", y="Average Depature delay (in minutes)", title = " Wind speed (in mph) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())In fact, take-off and landing are the only times during a flight when high winds can result in flight delays. Horizontal winds (also known as “crosswinds”) (about 25-35 mph) are generally prohibitive of take-off and landing.
data_fw %>%
filter(!is.na(wind_speed))%>%
filter(wind_speed > 0, wind_speed < quantile(wind_speed, 0.99)) %>%
group_by(wind_speed,season)%>%
summarise(count_delay = sum(count_delayed),
count = n())%>%
ggplot(aes(x=wind_speed,y=(100*(count_delay/count)))) +geom_line(stat = "identity",lwd=2,color="#00AFBB") +labs(x="Wind speed (in mph) ",y="Delay Percent (%)",title="Delay Percent (%) vs Wind speed (in mph) ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank()
)+facet_wrap(season~.)It is known that New York experiences its windiest weather during the spring and summer months of the year.
7.Wind Direction
In the weather data set, the wind direction parameter also has null values. The data cleaning process here, similar to wind speed, involves the removal of the null values using the na.rm() function, considering the fact that it takes only 1.76% of the total number of values in that column
data_fw%>%
filter(!is.na(wind_dir))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(wind_dir) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =wind_dir,y=avg_dep_delay,color = avg_dep_delay) +
geom_smooth()+labs(x=" Wind direction (in degrees)", y="Average Depature delay (in minutes)", title = " Wind direction (in degrees) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())The graph of Average Depature Delay versus wind direction resembles a sinusoidal curve, with average departure delay attaining local maxima when the air direction is approximately 0 or 180 degrees (plus or minus 360), and local minima when the air direction is 90 or 270 degrees (plus or minus 360).
image:
Question 3 : What is effect of airport and carrier on departure delay? Which airport and carrier are the best and the worst ?
1.Analysis of Airport
Preliminary inspection involves investigating three measures as follows:
Percentage delay: The proportion of delayed flights to the total flights in each airport.
Relative percentage of flights delayed: This is the proportion of flights delayed in each airport relative to the total number of flights delayed in NYC. This plot essentially supports the percentage delay plot.
Time departure percentage: The proportion of the number of flights departing “on time” to the total number of flights in each airport.
Before visualizing the percentage delay, the number of delayed flights per airport and the total number of flights per airport are explored, where only positive departure delays (late departures) are taken into account.
Total Number of Flights per Airport
flights%>%
filter(dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
summarise(count=n())%>%
ggplot(aes(y= count,x= reorder(origin,count),fill=count))+geom_bar(width = 0.5, stat="identity")+labs(y="Number of flights", x= "Airport", title="Airport vs Total Number of Flights Departing")+theme_bw()
Number of Delayed flights per airport
flights%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
summarise(count=n())%>%
ggplot(aes(y= count,x= reorder(origin,count),fill=count))+geom_bar(width = 0.5,stat="identity")+labs(y="Number of flights delayed", x= "Airport", title="Airport vs Number of Flights Delayed")+theme_bw()From the above graphs, it’s clear that EWR and JFK have the highest and second highest number of flights departing and highest number of delayed flights. LGA has the lowest for both.
1.1 Percentage Delay of Airports
When plotting the percentage delay, only positive departure delays (flights departing late) are taken into account.
flights<-flights%>%
mutate(count_delayed= ifelse(dep_delay>0,1,0))
tot_flights_airport<-flights%>%
filter(dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
summarise(tot_count=n())
delay_flights_airport<-flights%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
summarise(delay_count=n())
flight_per_airport<-tot_flights_airport%>%
inner_join(delay_flights_airport,by="origin")
flight_per_airport%>%
group_by(origin)%>%
summarise(delay_per= 100*(delay_count/tot_count))%>%
ggplot()+aes(y=delay_per, x=reorder(origin,delay_per),fill=origin)+geom_bar(width = 0.5,stat="identity")+labs(x="Airport", y=" Proportion of Delayed Flights out of Total Flights (%)", title="Percentage Delay of Flights Per Origin")+theme_bw()From the above graph, it’s clear that EWR has the highest proportion of delayed flights of its total flights, being the worst performing airport in that regard. On the other hand, LGA is the best performing airport.
1.2 Relative Percentage of Flights Delayed
flights%>%
filter(dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
summarise(rel_per= 100*n()/nrow(flights))%>%
ggplot()+aes(y=rel_per, x= reorder(origin,rel_per),fill=rel_per)+geom_bar(width = 0.5,stat="identity")+labs(y="Relative Percentage of Flights (%)", x="Airport",title="Percentage of Flights Delayed Relative to Total Number of Flights" )+theme_bw()From the above plot, it can seen that EWR has the highest percentage of delayed flights(35.39%) and LGA has the lowest percentage (30.57%).
1.3 Time Departure Percentage per Airport
When plotting the time departure percentage, the departure delay is split into two categories ‘On Time’ and ‘Delayed’. ‘On Time’ takes into account only those flights with a negative(early departure) and/or zero departure delay. ‘Delayed’ considers the rest of the departure delays (late departures).
flights_trd<-flights%>%
mutate(dep_category= ifelse(dep_delay <= 0,"on time", "delayed"))
flights_trd%>%
filter(dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
ggplot()+aes(x=origin,fill=dep_category)+geom_bar(width = 0.5)+labs(x="Airports",y="Number of Flights(Delayed/On Time", title="Count of Flights 'Delayed' and 'On Time' Per Airport")+theme_bw()flights_trd%>%
filter(dep_delay < quantile(dep_delay, 0.99))%>%
group_by(origin)%>%
summarise(per_dep=100*sum(dep_category=="on time")/n())%>%
arrange(desc(per_dep))%>%
ggplot()+aes(x=origin,y=per_dep, fill=origin)+geom_bar(width = 0.5,stat="identity")+labs(x="Airport",y="Percentage of time departure(%)",title="Time Departure Percentage Per Airport")+theme_bw()From the plot for the time departure percentage, it’s clear that LGA has the highest proportion of flights departing on time (67.58%) of its total flights and is hence the best performing airport in terms of time departure percentage. EWR is the worst performing airport (time departure percentage- 55.85%).
In conclusion, purely on the basis of the above produced plots, it’s clear that EWR is the worst performing airport and LGA is the best performing. Considering the total number of flights, LGA has the lowest number of flights departing and EWR has the highest number. Holistically, this maybe be due to the fact that EWR and JFK are primarily international airports, bringing larger number of people, possibly greater boarding time, greater number of security checks and greater number of flights when compared to LGA, which is primarily a domestic airport.
1.4 Effect Airport on Average Departure Delay
flights%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(origin) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =origin,y=avg_dep_delay,fill = avg_dep_delay) +geom_col(width = 0.5)+labs(x="Origin/Airport", y="Average Depature delay (in minutes)", title = "Origin vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())Here, LGA has the highest average departure delay, even though, LGA it had the lowest proportion of delayed flights. JFK has the lowest average departure delay.This could be due to the fact that LGA is primarily a domestic airport, which resources far less compared to that of the international airport JFK.
In order to investigate the reason as to why LGA has the highest average departure delay, it’s departure delay will be analysed over the span of the year.
flights %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
mutate(date = ymd(paste(year, month, day))) %>%
group_by(date,origin)%>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+aes(x =date, y = avg_dep_delay, group=origin, color=origin) +
geom_smooth()+labs(x="Duration", y="Average Depature delay (in minutes)", title = "Trend of Average Departure Delay by Duration for Origin")+
theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())From the above plot, it’s clear that during the time period between summer and winter season, LGA has a higher average departure delay over the other two airports. This maybe due to the lack of sufficient workforce to deal with the rush experienced during these peak season, when travel is likely to occur.
2. Analysis of Carrier
Again, preliminary analysis involves investigating the percentage delay for the carrier and relative percentage of flights departing per carrier.
In this case, the relative percentage of flights departing is the proportion of flights relative to the total number of flights departing in NYC. The percentage delay is the The proportion of delayed flights to the total flights in each carrier.
Before that, the number of flights per carrier is explored.
flights%>%
filter(dep_delay < quantile(dep_delay, 0.99))%>%
group_by(carrier)%>%
summarise(count=n())%>%
ggplot(aes(y= count,x= reorder(carrier,count),fill=count))+geom_bar( stat="identity")+labs(y="Total Number of Flights ", x= "Carriers", title="Total Number of Flights per Carrier")+theme_bw()flights%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(carrier)%>%
summarise(count=n())%>%
ggplot(aes(y= count,x= reorder(carrier,count),fill=count))+geom_bar( stat="identity")+labs(y="Number of Flights Delayed", x= "Carriers", title="Carriers vs Number of Flights Departed Late")+theme_bw()From the above plot, it’s clear that UA is the carrier with the highest number of delayed flights, followed by EV, B6 and DL.
The order is similar when it comes to the total number of flights, with UA having the highest total number of flights, followed by B6, EV and then DL
Exploring the percentage delay and relative percentage
Similar to the method followed in the analysis of airports, When plotting the percentage delay, only positive departure delays (flights departing late) are taken into account.
2.1 Percentage Delay of Carriers
flights<-flights%>%
mutate(count_delayed= ifelse(dep_delay>0,1,0))
flights%>%
filter( dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(carrier)%>%
summarise(prop_delay=100 * mean(count_delayed))%>%
ggplot()+aes(y=prop_delay, x= reorder(carrier,prop_delay),fill=prop_delay)+geom_bar(stat="identity")+labs(x="Carriers", y=" Proportion of Delayed Flights out of Total Flights (%)", title="Percentage Delay of Flights Per Carrier")+theme_bw()
From the above graph, it’s clear that carrier WN has the highest
proportion of delayed flights of its total flights, being the worst
performing carrier in that regard. WN is followed by FL,F9, UA and EV.
On the other hand, HA is the best performing carrier.
flights%>%
filter(dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(carrier,origin)%>%
summarise(prop_delay=100 * mean(count_delayed))%>%
ggplot()+aes(x=prop_delay, y= reorder(carrier,prop_delay),fill=prop_delay)+geom_bar(stat="identity")+labs(y="Carriers", x=" Proportion of Delayed Flights out of Total Flights (%)", title="Percentage Delay of Flights Per Carrier")+theme_bw()+facet_wrap(~origin)From the above, a few things can be inferred. First, the worse performing carrier in this regard, WN is the unsurprisingly has the highest percentage delay at two of the three airport (EWR and LGA). There are also cases when there are no flights from certain carriers at certain airports. Take the case of carriers FL,F9 and YV. There seem to be no flights flying from the international airports EWR and JFK. This maybe be due to the fact that those carriers are primarily focused on domestic services. However, considering the fact that they only operate out of LGA, they have relatively high percentage delay, most probably due to the fact that they’ll are smaller run operations, with limited resources.
2.2 Relative Percentage of Flights Delayed
flights%>%
filter(dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(carrier)%>%
summarise(rel_per= 100*n()/nrow(flights))%>%
ggplot()+aes(y=rel_per, x= reorder(carrier,rel_per),fill=rel_per)+geom_bar(stat="identity")+labs(y="Percentage of Flights (%)", x="Carriers",title="Percentage of Flights Delayed Relative to Total Number of Flights" )
Unsurprisingly, UA is the carrier, the carrier with the highest
number of total flights and total delayed flights, contributes most to
the relative percentage of flights delayed. However, considering the
smaller size of WN and it having the highest percentage delay per
carrier, WN shows a lot of promise as the worst performing carrier.
However, EV also is another possible option along with B6
In terms of finding the best performing carrier, the process is not straightforward, similar to finding the worst possible one, and requires additional analysis, namely looking into the average departure delay per carrier
2.3 Effect Carrier on Average Departure Delay
tot_avg_dep_delay<-flights%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
summarise(avg_dep_delay=mean(dep_delay,na.rm=TRUE))
flights%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
group_by(carrier)%>%
summarise(avg_dep_delay=mean(dep_delay,na.rm=TRUE))%>%
ggplot()+geom_bar(aes(x= avg_dep_delay, y=reorder(carrier,avg_dep_delay)),fill="lightblue", width=0.5, stat="identity")+labs(x="Average Departure Delay (in mins)",y="Carriers", title= "Average Departure Delay vs Carrier")+theme_bw()+geom_vline(aes(xintercept=mean(tot_avg_dep_delay$avg_dep_delay), linetype= "Average Departure Delay in New York"),color='red')In the above plot, although OO and YV seem to have the outright highest average departure delay, that result is quite deceptive. This is mainly due to the fact that OO and YV have very small number of flights departing. As a result for a less biased conclusion,the analysis on the basis of departure delay will only be focused on the carrier that have a substantial number of flights departing. Considering EV’s high number of flights departing and the fact that it has the third highest average departure delay, it can be concluded as the worst carrier of the bunch
Considering the fact that UA having the highest number of flights departing and the fact that it has relatively the lowest average departure delay of the other carriers having similarly high number of flights departing, UA shows a lot of promise as potentially the best carrier
In order to solidify that notion, a kind of calendar plot for a select few relevant carrier showing the departure delay per hour per month is created to observe if there are any jarring variations in any certain period, as in season/month when UA is not suitable.
2.4 Calender Diagram for Select Carriers
flights %>% group_by(carrier,month) %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
filter(carrier %in% c('UA','B6','EV','DL','AA','MQ','US','9E'))%>%
ggplot() +geom_tile(aes(x = hour, y = carrier, fill = dep_delay), color = 'black')+scale_fill_distiller(palette ='Spectral')+facet_wrap(~month)Having thoroughly investigated the performance of UA in terms of departure delay over the twelve months, no jarring seasonal issues where observed, apart from the occasional spikes. So overall, over the twelve month period, UA has remained relatively consistent in terms of departure delay, thus solidifying the earlier notion that UA is the best performing carrier
Question 4 - What is the impact of plane manufacturer and structure of the aircraft on departure delays?
Setting up global variable for planes and airports
data_fp <- flights %>%
inner_join(planes, by = c("tailnum"))%>%
mutate(count_delayed = ifelse(dep_delay > 0, 1, 0)) %>%
mutate(season = ifelse(month %in% 9:11, "Fall",
ifelse(month %in% 6:8, "Summer",
ifelse(month %in% 3:5, "Spring","Winter"))))%>%mutate(age_of_plane = 2013-year.y) %>%
mutate(date = ymd(paste(year.x, month, day))) %>%
mutate(date1 = ymd(paste(year.y, month, day)))
head(data_fp)## ID year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 1 2013 1 1 517 515 2 830 819
## 2 2 2013 1 1 533 529 4 850 830
## 3 3 2013 1 1 542 540 2 923 850
## 4 4 2013 1 1 544 545 -1 1004 1022
## 5 5 2013 1 1 554 600 -6 812 837
## 6 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## time_hour count_delayed year.y type
## 1 2013-01-01T10:00:00Z 1 1999 Fixed wing multi engine
## 2 2013-01-01T10:00:00Z 1 1998 Fixed wing multi engine
## 3 2013-01-01T10:00:00Z 1 1990 Fixed wing multi engine
## 4 2013-01-01T10:00:00Z 0 2012 Fixed wing multi engine
## 5 2013-01-01T11:00:00Z 0 1991 Fixed wing multi engine
## 6 2013-01-01T10:00:00Z 0 2012 Fixed wing multi engine
## manufacturer model engines seats speed engine season age_of_plane
## 1 BOEING 737-824 2 149 NA Turbo-fan Winter 14
## 2 BOEING 737-824 2 149 NA Turbo-fan Winter 15
## 3 BOEING 757-223 2 178 NA Turbo-fan Winter 23
## 4 AIRBUS A320-232 2 200 NA Turbo-fan Winter 1
## 5 BOEING 757-232 2 178 NA Turbo-fan Winter 22
## 6 BOEING 737-924ER 2 191 NA Turbo-fan Winter 1
## date date1
## 1 2013-01-01 1999-01-01
## 2 2013-01-01 1998-01-01
## 3 2013-01-01 1990-01-01
## 4 2013-01-01 2012-01-01
## 5 2013-01-01 1991-01-01
## 6 2013-01-01 2012-01-01
data_fa <- airports %>%
inner_join(flights, c("faa" = "dest"))%>%
mutate(count_delayed = ifelse(dep_delay > 0, 1, 0))
head(data_fa)## faa name lat lon alt tz dst
## 1 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 2 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 3 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 4 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 5 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## 6 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7 A
## tzone ID year month day dep_time sched_dep_time dep_delay
## 1 America/Denver 27276 2013 10 1 1955 2001 -6
## 2 America/Denver 28256 2013 10 2 2010 2001 9
## 3 America/Denver 29216 2013 10 3 1955 2001 -6
## 4 America/Denver 30230 2013 10 4 2017 2001 16
## 5 America/Denver 30954 2013 10 5 1959 1959 0
## 6 America/Denver 31811 2013 10 6 1959 2001 -2
## arr_time sched_arr_time arr_delay carrier flight tailnum origin air_time
## 1 2213 2248 -35 B6 65 N554JB JFK 230
## 2 2230 2248 -18 B6 65 N607JB JFK 238
## 3 2232 2248 -16 B6 65 N591JB JFK 251
## 4 2304 2248 16 B6 65 N662JB JFK 257
## 5 2226 2246 -20 B6 65 N580JB JFK 242
## 6 2234 2248 -14 B6 65 N507JB JFK 240
## distance hour minute time_hour count_delayed
## 1 1826 20 1 2013-10-02T00:00:00Z 0
## 2 1826 20 1 2013-10-03T00:00:00Z 1
## 3 1826 20 1 2013-10-04T00:00:00Z 0
## 4 1826 20 1 2013-10-05T00:00:00Z 1
## 5 1826 19 59 2013-10-05T23:00:00Z 0
## 6 1826 20 1 2013-10-07T00:00:00Z 0
1.Analysis of Manufacturer
1.1 Analysis of Manufacturer with respect to number of delayed flights
To understand the performance of the manufacturer first let us explore the number of delayed flights in each manufacturer. Below we see that ‘BOEING’,‘EMBRAER’,‘AIRBUS’,‘AIRBUS INDUSTRIE’,‘BOMBARDIER INC’,‘MCDONNELL DOUGLAS AIRCRAFT CO’,‘CANADAIR’ are on the top of the list. Hence carry forward this list for analysis
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(manufacturer) %>%
summarise(count=n())%>%
ggplot(aes(y=reorder(manufacturer,count),x=count))+
geom_bar(stat = 'identity',fill="steelblue")+
labs(y="Manufacturer",x="Number of departure delays",title="Number of departure delays in each Manufacturer")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
panel.background = element_blank()
)data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','MCDONNELL DOUGLAS AIRCRAFT CO','CANADAIR'))%>%
group_by(manufacturer)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(y=reorder(manufacturer,avg_dep_delay),x=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_vline(aes(xintercept=33.44),linetype = 2)+labs(y="Manufacturer",x="Average Departure Delay (in minutes)",title="Average departure delay for each Manufacturer ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())Here from the graph we see that although Boeing has the highest number of delayed flights, the average departure delay is less than the mean of departure delay.Also Canad air has the highest average departure delay.
To further investigate the trend in the top three manufacturers ‘EMBRAER’,‘BOMBARDIER INC’,‘CANADAIR’ with highest average departure delay we infer that the average departure delay for Canad air increases from July . Bombardier and Embraer show a decreasing trend in average denatured delay from July and increases in December i.e the holiday season
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
filter(manufacturer %in% c('EMBRAER','BOMBARDIER INC','CANADAIR'))%>%
group_by(date,manufacturer)%>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+aes(x =date, y = avg_dep_delay, group=manufacturer, color=manufacturer) +
geom_smooth()+labs(x="year", y="Average Depature delay (in minutes)", title = " Trend of Average Departure Delay for 'EMBRAER','BOMBARDIER INC','CANADAIR' ")+
theme(plot.title = element_text(hjust = 0.15),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())1.2 Analysis of Manufacturer with respect to average depature delay
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%group_by(manufacturer)%>%
summarise(avg_dep_delay =
mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x=reorder(manufacturer,avg_dep_delay), y=avg_dep_delay,fill=avg_dep_delay)+
geom_bar(stat="identity") +
labs(
title = "Average Departure Delays for Different Manufacturers",
x = "Manufacturer",
y = "Average Delay (mins)",
) + ylim(0,90) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())AVIAT AIRCRAFT INC has previously manufactured aircrafts with considerably higher delay times. Since We are not able to observe a pattern in these graphs, we can plot the same graphs but for all the carriers to notice any patterns with carriers.
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%group_by(manufacturer,carrier)%>%
summarise(avg_dep_delay =
mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x=reorder(manufacturer,avg_dep_delay), y=avg_dep_delay,fill=avg_dep_delay)+
geom_bar(stat="identity") +
labs(
title = "Average Departure Delays for Different Manufacturers",
x = "Manufacturer",
y = "Average Delay (mins)",
) + ylim(0,90) +facet_wrap(carrier~.)+
theme_bw()From the above graphs, we can observe that American Airlines Inc. (AA) has the highest delay times through out all carriers. This indicates that the delay may not be caused by manufacturer but the operations of the airlines.
2.Analysis of structure of the aircraft with respect to capacity(seats), engine, engine type
2.1.Aircraft Capacity or Seat
data_fp%>%
filter(!is.na(seats))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(seats) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =seats,y=avg_dep_delay,color = avg_dep_delay) +geom_point()+
geom_smooth()+labs(x=" Seat", y="Average Depature delay (in minutes)", title = " Seat vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())As we see that Average departure delay increases as the seats increases, specifically for huge flights. This delay may result due to circumstances where there is aircraft cleaning, baggage loading, fueling. Huge flights result in long lines for the passengers which causes increase in departure delay.
One other point that can be deduced from this graph is that the average departure delay is lowest at approximately 250 seats.
data_fp%>%
filter(!is.na(seats))%>%
group_by(seats_group=cut(seats,breaks= seq(0,450, by =75)),origin) %>%
summarise(Total_count=n())%>%
ggplot(aes(x = Total_count, y = reorder(factor(seats_group),
Total_count),fill=seats_group)) +
geom_bar(width=0.7, stat = "identity") +
theme_bw(base_line_size = 0, base_size = 9) +labs(x="Number of Flights",
y="Seat Number Group",title="Number of flights vs Seat per Origin")+facet_wrap(origin~.)With respect to the origin, we observe that JFK has the highest number of huge flights followed by EWR. From the analysis we found that the number of delayed flights are higher in proportion in EWR and JFK. This could be due to the fact that LGA is primarily a domestic airport, which has less big flights compared to that of the international airport EWR, JFK.
Taking a look at the manufacturing companies, we understand that Boeing, Airbus, and Airbus Industries produce giant aircraft with seats ranging from 150 to 225. Additionally, Embraer has only one seat group that is 75 seats.
data_fp%>%
filter(!is.na(seats))%>%
filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','MCDONNELL DOUGLAS AIRCRAFT CO','CANADAIR'))%>%
group_by(seats_group=cut(seats,breaks= seq(0,450, by =75)),manufacturer) %>%
summarise(Total_count=n())%>%
ggplot(aes(x = Total_count, y = reorder(factor(seats_group),
Total_count),fill=seats_group)) +
geom_bar(width=0.7, stat = "identity") +
theme_bw(base_line_size = 0, base_size = 9) +labs(x="Number of Flights",
y="Seat Number Group",title="Number of flights vs Seat per Manufacturer ")+facet_wrap(manufacturer~.)2.2 Engine Type and Number of Engines
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(engine)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(x=engine,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Engine Type",y="Average Departure Delay (in minutes)",title="Average departure delay for each engine type ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())4 cycle engines show the highest departure delay delay time. However the Turbo fan has the highest number of delayed flights
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(engine,carrier)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(x=engine,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Engine Type",y="Average Departure Delay (in minutes)",title="Average departure delay for each engine type ")+theme_bw()+facet_wrap(carrier~.)
American Airlines Inc. (AA) potentially have the highest number
of 4 cycle engines, which causes the average delay time for American
Airlines to higher than other.
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(engines)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(x=engines,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Number of Engines ",y="Average Departure Delay (in minutes)",title="Average departure delay vs Number of engines ")+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())
Aircrafts with 3 engines usually are running early. However, all
other number of engines show high delay times. As the number of engines
increase from 1 to 2 to 4, the delay time also increases.
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(engine,engines) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x =engines, y = avg_dep_delay, group=engine, color=engine)) + scale_x_continuous(breaks = c(1,2,3,4))+
geom_line(lwd = 2) +labs(x="Number of Engines", y="Average Depature delay (in minutes)", title = " Trend of Average Departure Delay vs Number of engines for each engine type ")+geom_hline(aes(yintercept=33.443),linetype = 2)+
theme_bw()The graph above demonstrates that the departure delay decreases with increased engine number for reciprocating and turbo-fan engines. However, the same cannot be said for four-cylinder engines and turbo-jets
2.3 Age of the air craft
data_fp %>%
filter(!is.na(age_of_plane ))%>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(age_of_plane) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+ aes(x =age_of_plane,y=avg_dep_delay,color = avg_dep_delay) + geom_point()+
geom_smooth(method="lm")+labs(x="Age of the plane in years", y="Average Depature delay (in minutes)", title = "Average Depature Delay vs Age of the plane")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
strip.text.x=element_text(size=8),
strip.background = element_blank(),
panel.background = element_blank())As the age of the aircraft does not correlate well with the average departure delay, we will analyze it by categorizing it by the manufactures that have the greatest number of delayed flights
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','CANADAIR'))%>%
group_by(age_of_plane,manufacturer)%>%
summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(x=age_of_plane,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity")+scale_x_continuous(breaks = c(5,10,15,20,25,30) ) +xlim(0, 30)+geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Age of the plane",y="Average Departure Delay (in minutes)",title="Average Depature Delay vs Age of the plane per manufacturer")+theme_bw()+facet_wrap(manufacturer~.)It is evident that airlines that use Boeing aircraft of all ages and the average departure delay is lower than the average departure delay. This indicates that Boeing aircraft are well maintained and have good availability of service and parts. Also Embraer has high departure delay throughout the all ages of the aircraft
data_fp %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(year.y)%>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+
aes(x=year.y, y=avg_dep_delay)+
geom_line(lwd=2,colour="steelblue") +
labs(
title = "Average Departure Delay Time v/s Year of Manufacture",
x = "Year of Manufacture",
y = "Average Departure Delay (mins)",
)+
theme_bw()From this plot, we can infer that there is no linear correlation between year of manufacture and delay time. However, we can notice that after every few years the delay time starts increasing and then again decreases. This can be caused due to technological advances being made in the industry to innovate.
Question 5 : Is there a pattern to the departure delay in terms of geography of our analysis?
1. Timezones of the airports
world <- map_data("world")
ggplot() +
geom_map(data = world, map = world,aes(x=long,y=lat,map_id = region),color = "black", fill = "lightgray")+
geom_point(data=airports,mapping=aes(x=lon,y=lat,col=tzone))+
labs(
title="Timezones of Airports",
x = "Longtitude",
y = "Latitude"
) The United States is spread across six time zones. From west to east, they are Hawaii, Alaska, Pacific, Mountain, Central, and Eastern.
2. Average Depature Delay in each destination airport
The average departure delay in destination airports are concentrated on the eastern region of the USA
The eastern seaboard contains states such as Massachusetts, New York, New Jersey, Virginia, North Carolina, South Carolina, Georgia, and Florida.These are the main and popular states of USA
data_fa %>%
filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
group_by(faa, lat, lon) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE),Number_of_delayed_flight_in_1000 = n()/1000) %>%
ggplot() +
aes(x = lon,
y = lat,
color = avg_dep_delay,
size = Number_of_delayed_flight_in_1000
) +
geom_point() +borders("state") +
labs(
title = "Average Departure Delays From NYC by Destination",
x = "Longtitude",
y = "Latitude"
) Recommendations
This section contain the recommendations based on the exploratory analysis of departure delay
Question - Is there a pattern to the departure delay in terms of time? (Month, Day of week and Hour)
Average departure delay by hour
The graph below shows that during the day, the average departure delay is considerably lower in the early hours compared to the latter part of the day. Following the data in the graph, the preferred time to fly is between 5 a.m. and 8 a.m. to avoid delays since the average departure delay is approximately 20 minutes. So, it’s recommended that airports make use of this situation by scheduling more flights in the early hours to reduce the stress during the rest of the day. This would require a thorough analysis and revamping of the airport flight slot scheduling system already in place.
image:
Number of delayed flights by month
From the below graph, it’s evident that the summer and winter seasons,which corresponds to the holiday season, contributes most to the number of delayed flights. The summer season would attract many international tourists, and winter would bring many domestic tourists visiting family and friends. The high volume of passengers would lead to longer queues at flight check-in counters, which will often cause flight departure delays. To mitigate this, the airport should increase the workforce available and open additional counters for security checks and flight check-in.
Since this surge in passengers during the holiday seasons are temporary spikes relative to the entire year, another recommendation is to open temporary airstrips, areas which can easily be converted to productive land, used for agriculture, or set up solar energy infrastructure after the peak passes. Along the same lines, during this period, another recommendation is to maintain a separate landing strip for private planes so that their slots can be allocated to other public planes.
image:
Average departure delay vs. week
From the below graph, it’s clear that Tuesday and Wednesday show a lower average departure delay than the mean value. This could be because flying mid-week typically requires time off work, which typical working professionals could find difficulty. Consequently, the airports are relatively quieter, resulting in fewer delays. To optimize the airport’s operational cost, it’s advisable to rework the staff scheduling strategy to reflect the lower operational demands.
image:
Question - How does weather impact flights from NYC? What is the effect of weather on departure delay?
Weather
From the investigation of different weather parameters’ impact on the departure delay, it’s evident that adverse weather conditions (low visibility, high precipitation, high relative humidity, etc.) significantly negatively impact the delay.One recommendation to improve departure delay during adverse weather conditions is to invest in and enhance the take-off strip lighting system at the airports (EWR, JFK, LGA), essentially helping the pilot effectively manage these situations. It’s also advisable for the airport authorities to have specialist staff in place during adverse conditions to promptly remove debris and other objects from the runway. There’s also the possibility of emergency flights landing in the airports in question that can cause unexpected departure delays for the flights about to take off. To mitigate this, it is advised to have proper communication systems to communicate seamlessly with air traffic control.
Question - What is effect of airport and carrier on departure delay? Which airport and carrier are the best and the worst ?
Analysis of Airport
Airport vs Average Departure Delay
From the below graph, it’s clear that of the three airports, LGA has the highest average departure delay, and JFK has the lowest. As mentioned earlier, this is probably due to the fact that LGA is a domestic airport and is smaller compared to EWR and JFK, which are international airports with large capacities. In general, smaller airports usually have limited infrastructure, facilities, and operational resources that affect not just the flight schedule but also the number of flights that can be accommodated at the airport at the same time, which could lead to significant flight departure delays, leading to the propagation of that delay to other flights. To reduce the departure delay at LGA, it’s advisable to optimize the airport ground usage to maximize the area available so that the airport can accommodate a more significant number of flights. Another suggestion is to reduce the buffer time between consequent flight take-off to prevent departure delay propagation.
image:
Airport vs Percentage Delay of Flights.
From the graph below,it`s clear that EWR’s high percentage delay is another issue because it has the highest number of flights departing from it. That implies it affects a more significant number of people. A few recommendations to reduce the delays in flight departure at EWR are: increasing the number of security check and flight check-in counters, optimizing logistics involved in the airport operation by investing more money into it, and ensuring proper maintenance of core elements influencing airport operation like runways, taxiways, etc. The final recommendation is to invest in airport monitoring technology. With AI, existing security and observation cameras at airports and airlines will automatically detect delays in these turnaround services. This alerts airport staff or ground crew members to the issue in real-time so that they can formulate a mitigation strategy.
image:
Analysis of Carrier
Considering the number of delayed flights, percentage delay, and average departure delay, EV was the worst carrier in terms of its performance. One recommendation to reduce the departure delay is to optimize the carrier’s security, check-in, and other operational policies to maximize the efficiency of its operation. It’s also advisable for the airline to revamp its flight maintenance strategy to prevent possible issues leading up to take-off. The carrier should also ensure that the pilots they employ have the experience and know-how to deal with difficult situations. One final recommendation is for the carrier to make virtual flight check-in compulsory for all passengers via an app allowing users to upload their digital certificates, eliminating paperwork, and allowing smartphones to serve as our digital ID.
Question - What is the impact of plane manufacturer and structure of the aircraft on departure delays?
Effect of manufacturer and structure of the plane
From the analysis carried out earlier, it`s clear that the manufacturer has an impact on the flight departure delay. When it comes to the manufacturer specifically, one suggestion is for airport authorities to set stringent guidelines for the quality of flights from the manufacturer to ensure that their flights meet the industry standards and compliance, blocking flights from manufacturers that don’t meet the desired quality. This will ensure the timely departure of flights from the airport without any technical issues related to the manufacturer. One way that manufacturers can help reduce possible departure delays is by innovating and improving the on-board flight instrumentation and sensor so that they can function properly even in adverse weather conditions.
In terms of the plane’s structure, the number of engines, and the type of engine, small planes, especially light planes, aren’t the most practical choice in strong winds or heavy rain. Turbulence is more likely to affect smaller and lighter planes. Larger aircraft, like commercial jets, have multiple engines and are generally more significant and have more endurance than the more compact ones. They can withstand strong winds and heavy rains a lot easier. The fact that Boeing creates large commercial flights (inferred from the earlier investigation) could be the reason why the average departure delay of Boeing is lower than the others, regardless of age.
Conclusion
In conclusion, having been approached by the Port Authority of New York and New Jersey (PANYNJ) to find possible issues and corresponding solutions related to departure delay, the NYC data sets provided were explored, and exploratory questions were formulated to reduce the departure delay. Once the questions were formulated, they were analyzed to gain insights using exploratory and visual analytical techniques. In the end, based on the insight derived, recommendations were provided in the context of the exploratory questions. A Tableau dashboard was also created to complement the analysis.
References
Aswesawit, 2022. How to Avoid Flight Delays: 7 Tips for Travelers[Online]. Available from: https://www.aswesawit.com/how-to-avoid-flight-delays/ [Accessed December 07, 2022].
Sherburnaeroclub,2022. Flying in bad weather[Online]. Available from: https://www.sherburnaeroclub.com/blog/flying-in-bad-weather#factors-that-affect-aircraft-safety-in-bad-weather/ [Accessed December 07, 2022].
BBC, 2022. The airport tech helping to prevent delayed flights[Online]. Available from: https://www.bbc.co.uk/news/business-60228430/ [Accessed December 07, 2022].
Contribution List
Question 1:Aravind Gopakumar,Umapujitha Singh
Question 2:Umapujitha Singh
Question 3:Aravind Gopakumar
Question 4:Piyush Jain
Question 5:Umapujitha Singh
Tableau: Aravind Gopakumar, Umapujitha Singh
Report Making: Aravind Gopakumar,Piyush Jain, Umapujitha Singh